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Abstract 

We consider the problem of online estimation of a real-valued signal 
corrupted by oblivious zero-mean noise using linear estimators. The esti- 
mator is required to iteratively predict the underlying signal based on the 
current and several last noisy observations, and its performance is measured 
by the mean-square-error. We describe and analyze an algorithm for this 
task which: 

1. Achieves logarithmic adaptive regret against the best linear filter in 
hindsight. This bound is assyptotically tight, and resolves the question 
of Moon and Weissman HI. 

2. Runs in linear time in terms of the number of filter coefficients. Previ- 
ous constructions required at least quadratic time. 



1 Introduction 

We consider the problem of filtering: designing algorithms for the causal estima- 
tion of a real valued signal from noisy observations. The filtering algorithm ob- 
serves at each iteration a noisy signal component, and is required to estimate the 
corresponding underlying signal component based on the current and past noisy 
observations alone. 

We consider finite fixed-length linear filters that combine the current and several 
last noisy observations for prediction of the current underlying signal component. 
Performance is measured by the mean square error over the entire signal. Follow- 
ing the setting in [IJ, we assume that the underlying signal is an arbitrary bounded 
signal, possibly even adversarial, and that it is corrupted by an additive zero-mean. 
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time-independent, bounded noise with known constant variance 0. 
The approach taken in this paper is to construct a universal filter - i.e. an adaptive 
filter whose performance we compare to an optimal offline filter with full knowl- 
edge of the signal and noise. The metric of performance is thus regret - or the 
difference between the total mean squared error incurred by our adaptive filter, 
and the total mean square error of the offline benchmark filter. 
The question of competing with a fixed offline filter was successfully tackled in 
[[n. In this paper we consider a more challenging task: competing with the best 
offline changing filter, where restrictions are placed on how often this optimal of- 
fline filter is allowed to change. A more stringent metric of performance what 
fully captures this notion of competing with an adaptive offline benchmark is 
called adaptive regret: it is the maximum regret incurred by the algorithm on 
any subinterval. 

We present and analyze simple, efficient and intuitive algorithms that attain log- 
arithmic adaptive regret. This bound is tight, and resolves a question posed by 
Moon and Weissman in |[T1. Along the way, we introduce a simple universal 
algorithm for filtering, improving the previously known best running time from 
quadratic in the number of filter coefficients to linear. 

1.1 Related Work 

There has been much work on the problem of estimating a real- valued signal from 
noisy observations with respect to the MMSE loss over the years. Classical results 
assume a model in which the underlying signal is stochastic with some known pa- 
rameters, i.e. the first and second moments, or require the signal to be stationary, 
such as the classical work of [2J. The special case of linear MMSE filters has 
received special attention due to its simplicity jSl. For more recent results on 
MMSE estimation see |I4l[51[6l|7l|• 
In this work we follow the non- stochastic setting of fP]: no generating model is 
assumed for the underlying signal and stochastic assumptions are made on the 
added noise (that it is zero-mean, time-independent with known fixed variance). 
In this setting, while considering finite linear filters, [IJ presented an online algo- 
rithm that achieves logarithmic expected regret with respect to the entire signal. 
The computational complexity of their algorithm is proportional to a quadratic in 
the linear filter size. 

Henceforth we build on recent results from the emerging online learning frame- 
work called online convex optimization [l8l|9l. For our adaptive regret algorithm, 
we use tools from the framework presented in ifTOl to derive an algorithm that 

'The justification of [1] for assuming that the variance is a known constant is that this variance 
could be learned by sending a training sequence in the beginning of transmission. 
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achieves logarithmic expected regret on any interval of the signal. 

2 Preliminaries 

2.1 Online convex optimization 

In the setting of online convex optimization (OCO) an online algorithm A is 
iteritevly required to make a prediction by choosing a point Xt in some convex 
set /C. The algorithm then incurs a loss lt{xt), where lt{x) : )C ^ R is a convex 
function. The emphasis in this model is that on iteration t, A has only knowledge 
of the loss functions in previous iterations h{x), lt-i{x) and thus lt{x) may be 
chosen arbitrarily and even adversely. The standard goal in this setting is to min- 
imize the difference between the overall loss of A and that of the best fixed point 
X* e /C in hindsight. This difference is called regret and it is formally given by, 

T T 

Rt{A) = Y] lt{xt) - min V lt{x) 
t=i t=i 

A stronger measure of performance requires the algorithm to have little regret 

on any interval / = [r, s] C [T] with respect to the best fixed point xj e /C in 
hindsight in this interval. This measure is call adaptive regret and it is given by , 

s s 

ARt{A) = sup {V lt{xt) - min V(/t(a;)} 

I=[r,s]C[T] ^ 

2.2 Problem Setting 

Let Xt be a real-valued, possibly adversarial, signal bounded in the range [—Bx ■ --Bx] ■ 
The signal Xt is corrupted by an additive zero-mean time independent noise rit 
bounded in the range [—Bn---Bx] with known time-invariant variance cr^. An 
estimator observes on time t the noisy signal yt = xt + Uf, and is required to 
predict xt by taking a linear combination of the observations yt,yt-i, ...,yt-d+i 
where d is the order of the filter. That is, the estimator chooses on time t a filter 
Wt e M.^ and predicts according to wjYt where Yt e R'^ and Yt{i) — yt-i+i, 
1 < i < d. The loss of the estimator after T iterations is given by the mean- 
square-error ^ YlJ=iixt — wjYtY. 

In case Xt is observable to the online algorithm, minimizing the regret and the 
adaptive regret is fairly easy using the framework of OCO with the loss functions 
hi'i^t) — {xt — wjYtY. However in our case, the algorithm only observes the 
noisy signal yt and thus online convex optimization algorithms could be directly 
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used. Denoting lt{w) = {yt — w^^YtY + 2w^ c where c G M'^, c = (cx^, 0..., 0), 
it was pointed out in |[T1 that if Wt depends only on the observations yi, ...,yt-i, 
then for any w G M'^ it holds that, 
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T 



T 



w 



t=l 



t=l 



E 



w 



t=l 



t=l 



(1) 



Thus by using OCO algorithms with the estimated loss functions it{w) we may 
minimize the expected regret with respect to the actual losses kiw). Thus a simple 
algorithm such as [8] immediately gives a 0{VT) bound on the expected regret 
as well as on the expected adaptive regret with respect to the true losses lt{w), as 
long as we limit the choice of the filter to a euclidean ball of constant radius. 



2.3 Using Strong- Convexity and Exp-Concavity 

Given a function /(x) : /C — )■ M we denote by V/(x) the gradient vector of / 
at point X and by V^/(a;) the matrix of second derivatives, also known as the 
Hessian, of / at point x. f{x) is convex at point x if and only if V^/(a;) ^ 0, that 
is its Hessian is positive semidefinite at x. 

We say that / is H -strongly-convex, for some > 0, if for all x G /C it holds that 

V^/(x) y HI, where I is the identity matrix of proper dimension. That is all the 

eigenvalues of V^/(x) are lower bounded by H for all x E JC. 

We say that / is a-exp-concave, for some a > 0, if the function exp {~af{x)) is 

a concave function of x G /C. It is easy to show that given a function / such that 

/ y HI and max^-gy^ || V/(x)||2 < G it holds that / is ^-exp-concave. 

In case all loss functions are i7- strongly-convex or «-exp-concave, there exists 

algorithms that achieve logarithmic regret and adaptive regret JU [TOl . 

In our case, the Hessian of the loss function lt{w) is given by the random matrix 

V'^lt{w) = 2YtY^ which is positive semidefinite and it holds that 

E [YtY^'] = E [XtXj + NtXj + XtNj + NtNj] = XtXj + h (2) 

Nevertheless, in worst case, lt{w) need not be strongly-convex or exp-concave 
and thus algorithms such as ^ \TU^ could not be directly used in order to get 
logarithmic expected regret and adaptive regret. 



3 A Simple Gradient Decent Filter 

In this section we describe how the problem of the loss functions It not necessarily 
being strongly-convex or exp-concave could be overcome and introduce a simple 
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gradient decent algorithm based on |l9l that achieves O(logT) expected regret. 
For time t and filter w G M'^ we define the following loss functions. 



L'^{w)= Yl k{w) + iw-WtV l{k-d + l)aH- Y ytYt" ] (w - wt) 0) 



r=t-k+l 



r=t-k+d 



where Wt is the filter that was used by the algorithm for prediction in time t and 
k E is a parameter. 

Our Gradient Decent filtering algorithm is given below. 



Algorithm 1 GDFilter 
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Input: A; G N+, iJ G M+, i? G B 
Let wi = Od 
for c = 1. .. do 

for t = {c— l)k + l...ck do 
predict: xt = wjYt. 

end for 

if ||wc+i|| > R then 
else 

Wc+l <- Wc+l- 

end if 
end for 



R 

ll'I'c+ll 



We have the following theorem and corollary. 

Theorem 1. Let Wt be the filter used by algorithm\I\for prediction in time t. Let 
k = 2d and H = da"^. Algorithm\l\achieves the following regret bound, 



E 



■ T 

,t=i 



hiwt) 



— min E 

«)eiR'',||w)||</? 



■ T 



t=l 



o 



d^R\Bx + BN) 



a" 



logT 



Corollary 1. Let Wt be the filter used by algorithm\l\for prediction in time t. Let 
k = 2d, H = da^ and let R 



VdB 



E 



t=i 



— min E 

wi 



■ T 



— . It holds that, 

= O 



.t=i 



logT 
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Basically the new loss function ([3]) sums several consecutive losses and adds 
a regularization expression. We show that since the regularization expression de- 
pends on the actual choices of the filtering algorithm, achieving low regret with 
respect to L^{w) implies low regret with respect to the losses lt{w). Moreover, as 
we will show, the combination of summing several losses and adding regulariza- 
tion, insures that L^{w) is always strongly-convex for a proper choice of k, and 
thus we can use the algorithms in 
It holds that. 



[ ITOl to get logarithmic regret. 



T=t-k+l 
t 



T=t-k+d 
t 



= 2 J2 YtYt'' + 2{k - d + l)aH - 2 ^^^^ 

r=t-k+l T=t-k+d 

y 2{k-d + l)aH (4) 
Thus for k > d, L'^w) is always 2(k — d + l)(T^-strongly-convex and 2(k ~ d + 



l)cr /G -exp-concave where G = max 



■w,t 



w)\\. 



We thus use the gradient decent algorithm in f9l by partitioning the iterations 
into disjoint blocks of length k each, and our algorithm updates its filter every d 
iterations according to the loss function L^(w) for t = ck, c G Z and predicts 
using the same filter on all iterations in the same block. The value of k is assumed 
to be a constant independent of T. 



Abusing notation, we switch between 



w 



and 



w 



interchangeably where 



we use L^{w) to refer to the loss on block number c of length k. 
The following Lemma plays a key part in our analysis. 

Lemma 1. Let A be a filtering algorithm that updates its filter every k iterations. 
Denote by Wt the filter used for prediction on iteration t and denote by Wc the filter 
used to predict on the entire block c, that is on iterations {{c — 1) ■ k + l)...c ■ k. 
It holds that 



E 



t=i 



t=i 



< E 



T/k 



ck 



T/k 



ck 



W 



Proof. First we assume w.l.o.g. that T = b ■ k for some b E N^. Otherwise it 
holds that T = b ■ k + a where < a < A; and thus the regret on the additional a 
iterations is a constant independent of T and we can ignore it in the regret bound. 
We now have. 



6 



T/k T/k 

J2^ckM-J2^ckH (5) 

c=l c=l 

T/k / ck / ck \ 

c=l y=(c-l)fc+l \ T=ck-k+d ) 

T/k ( ck / ck \ 

c=l y=(c-l)fc+l \ T=ck-k+d / 

T T/fc / ck 

t=l c=l y r=(c-l)fc+l 

Since A updates its filter every k iterations, we have that Wck depends only on the 
random variables rii, n(^c-i)k- Thus using ^ we have for all c we that, 



Wc 



E 



ck 



{w - WcY \{k-d + l)aH - Y ^rYj I {w - Wc) 

T = {c~l)k + l 

ck 



{k -d + l)a'^^\\w - Wc 



E 



T=(c-l)k+l 



O E [(w - Wc){w - Wc)^] 



{k -d + l)a^E[\\ W — WrW 

ck \ 

Y ^rXj + {k-d + l)aH \ oE[{w -Wc)iw - WcV] 

^r=(c-l)fc+l J 
ck 

- Y Xr Xj OE [{W -Wc){w -Wcf] <0 

T=(c-l)fc+l 



Overall by taking expectation over ^ we get 



E 



T/k T/k 

E^'^K)-E^c.H 



c=l 



c=l 



> E 



Yhiwt) - hiw) 



t=i 



The lemma now follows from ©• 



□ 



According to Lemma [T] we can reduce our discussion to algorithms that pre- 
dict in disjoint blocks of length k and achieve low regret with respect to the loss 
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function L'^{w) 

In order to derive precise regret bounds we give a bound on G = max^; ^ 1 1 VL'l {w)\\. 
VLl{w) = 2 Ytiyt-wjYt) + 2({k-d + l)a'l- J] YrvAiw- 

T=t~k+1 \ T=t-k+d / 

Thus by simple algebra we have, 

= 0{k^d{Bx + BMfR^d{Bx + BNf + k^d?{Bx + BMYR^) 
= 0{ed^R^{Bx + Bjv)^) 

Where i? is a bound on the magnitude of the filter. That is we consider only filters 
w & such that \\w\\2 < R. R needs to be bounded since the regret of online 
convex optimization algorithms grows with G. 
As pointed out in [JJ, for 

w* = arg min E 
It holds that ||w*|| < 

II II (7^ 

We denote by G{k,R) an upper bound on max^ ^ ||VL^(w)|| parametrized by 
k,R. 

For the complete proof of the theorem and corollary the reader is referred to the 
appendix. 
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4 An Adaptive Algorithm 



In this section we present an algorithm that is based on the framework from [flOll 
and achieves logarithmic expected regret on any interval / = [r, s] C [T] . Our 
algorithm is given below. We have the following theorem and corollary. 



Theorem 2. Let wt be the filter used by algorithm \2\for prediction in time t. Let 
k = 2d and let a = q(0^- For all I = [r, s] C [T], algorithm^achieves the 
following regret bound, 
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t=r 
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logT 
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Algorithm 2 AdaptiveFilter 



1: Input: /c G N+, a G R+. 

2: Let -E^, E'^ be online convex optimization algorithms. 

3: Letpi G M^,p^^^ = 1, Vj : 1 < j < T,p^^'^ = 0. 

4: for c = 1. .. do 

5: Vj < c, <- E\L\, (the filter of the j'th algorithm). 

6: Wc <- ^'j^-^^Pc^wi^'K 

7: fort = (c- 1)A; + l...cA;do 

8: predict: Xt = wjYt. 

9: end for 

10: ;p^'^Y^ = andforz G [c], 

.(i) pWg-aLS(«,«) 

11: p^'^Y^ = V(c + 1) and for z G [c] : p^'^i = (1 - (c + (adding 

expert E^^+i)). 

12: end for 



Corollary 2. Let Wt be the filter used by algorithm\2\for prediction in time t. Let 



k = 2d, R 



and let a 



G{2d,Ry 

achieves the following regret bound, 



For all I = [r, s] C [T], algorithm^ 



E 



t=r 



min E 

Ulf 



t=r 



As in the previous section, we take the approach of partitioning the iterations 
into disjoint blocks of length k and optimizing over the loss functions L'j:. 
The algorithm is based on the well known experts framework where each expert 
in our case, is a gradient descent filter presented in the previous section. On 
each block c, the algorithm adds a new expert that starts producing predictions 
from block c + 1 an onward. The experts algorithm predicts on each iteration by 
combining the filters of all experts using a weighted sum according to the weight 
of each expert. The key idea behind this framework is that an expert added at 
block r achieves low regret on all intervals starting in r. Given such an interval, 
the experts algorithm itself achieves low regret on the interval with respect to this 
specific expert, and thus has low regret on the interval. 

Expert E'' could be thought of as an algorithm that plays Wc = for all c < r and 
starting at block r plays according to algorithm [T] 
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For the complete proof of the theorem and corollary the reader is referred to the 
appendix. 
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A Proof of Theorems 11 El 

The proofs are based on BH [TOl and are brought here in full detail for complete- 
ness. 
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Theorem 3. Let Wt be the filter used by algorithm \l\for prediction in time t. Let 
k = 2d and H = da^. Algorithm\l\achieves the following regret bound, 



E 



,t=i 



— mill E 

?i>ei8'*,!|u)||<i? 



t=l 



o 



d^R^{Bx + BN) 



a' 



logT 



Proof. Again we assume w.l.o.g that T = b ■ k for some b E N+. Consider some 
w e R'^ such that ||w||2 < R. Define = VLj(wc) and = \/^L^^{wc), 
G = G{2d, R). Writing the Taylor series approximation of L^{w) around Wc we 
have, 

L^H = Ll{w,) + V^^{w-w,) + ]^Vlo{w-w,){w-w,Y 

According to @i, Vl h 2{k - d + l)a'^I and we have, 

L^H > L^K) + V]{w -Wc) + {k-d+ l)a^\\w - Wc\\l (6) 
Following the analysis in [[8l HI we upper bound VJ(il' — Wc) by. 



2VJ(W- We) < 



\Wc — W\ 



\Wc+l - W\ 



Vc+1 



Summing over (|7]) for all c, using ^ we have. 



(7) 



T/k 



T/k 



\Wr- — W\ 



{H{c +l)~Hc-{k-d+ l)a^) 



c=l 



c=l 



T/k 

^ He 

c=l 



Plugging H = da'^ yields 



T/k 

Y.Ll{w^)-Ll{w) = o(^^\ogT 

t=c ^ 



The theorem now follows from ([T]) and plugging G = G{2d, R). 



□ 



In order to prove Theorem[2]we need two simple claims first. In what follows 
we assume that L^{w) is a-exp-concave. 



11 



Claim 1. 1. For i < c, 

Proof. Using the a-exp concavity of we have 



i=i 

Taking logarithm, 



Thus, 



= a ^ In 



-1 1 Pc+l 

= a in — VT- 

w 

Pc 

Now, by definition it holds that for i < c, pc^ = (1 — l/c)pc\ Also, p^c^ 
Plugging these two equalities into ([8]) yields the claim. 

Claim 2. For any two integers r, s such that s > r, it holds that 



c=r 
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Proof. Using the previous claim we have, 

c=r 

= (L^K)-L^(«;«))+ Ll{w^) - Lliw'^:^) 

C=T+1 

< a-^ [ Inpi^^i + Inr + ^ Inp^i - ^^Pc^ + 2/c 

\ c=r+l 

= |lnr + ln]32i+ ^ 2/c 



c=r+l 



Since p^'^^ < 1, Inp^^^ < 0. This implies that the regret is bounded by f In T. □ 

We can now prove Theorem [2l 

Theorem 4. Let Wt be the filter used by algorithm ^\for prediction in time t. Let 
k = 2d and let a = Qf^j^' ^ ~ t^' — t-^]' <^l8orithm^achieves the 

following regret bound, 



E 



t=r 



Wt) 



min E 

»i)eKrf,|l«;||<i? 



t=r 



Proof. Given an interval / = [r, s] C \T],\q\. r = Cr ■ k — hr, s = Cs ■ k + hg such 
that Cr, hr, Cs, 6s G N and < br,bs < k — 1. 

Since A; is a constant independent of T, we ignore the first br iterations and last bs 
iterations, since they only add a constant to the regret. 
According to Claim [2] we have, 

C=Cr 

Since E^' achieves low regret on all block-intervals beginning in block r we have 
for all w G M such that llwlU < -R, □ 



E 



4'(«^r)-4» = o(^^^iogT 
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Thus we have, 

X;LjK)-LjW = o(^*^iogr) 

C=Cr 

Again, the theorem now follows from ([T]) and plugging G = G{2d, R). 
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